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ABSTRACT 


This paper attempts to quantify the accuracy limit of “next- 
item-correct” prediction by using numerical optimization to 
estimate the student’s probability of getting each question 
correct given a complete sequence of item responses. This 
optimization is performed without an explicit parameterized 
model of student behavior, but with the constraint that a 
student’s likelihood of getting a problem correct only in- 
creases or remains unchanged with additional practice (i.e., 
no forgetting). We present results for this method for the 
Assistments 2009-2010 data where it suggests that there is 
only modest opportunity for improvement beyond the state 
of the art predictors. Furthermore, we describe a frame- 
work for applying this method to datasets where problems 
can be tagged with multiple skills and problem difficulties. 
Lastly, we discuss the limitations of this method, specifically 
its inability to give tight bounds on short sequences. 


1. INTRODUCTION 


Student modeling is a fundamental building block of educa- 
tional systems that are intelligent or adaptive. With a model 
of a student, such a system can consider all of the actions 
it has available and make a prediction about which ones are 
likely to be the most profitable for a particular student at 
the current time. 


One class of student models tries to predict next-item-correct, 
i.e., what is the probability that a student’s attempt on the 
next item presented will be correct given the student’s re- 
sults on all previous items. For a number of years, this topic 
saw vigorous research with non-trivial improvements using 
improved model parameterizations [1, 6, 7, 11] and recurrent 
neural networks [10]. Yet, performance of next-item-correct 
predictors has seemed to reach an asymptote that is far be- 
low perfect prediction. 


This gap between the current state of the art and perfect 
prediction raises the question of how much headroom re- 
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mains for further improvements to next-item-correct predic- 
tion. Previous work by Beck and Xiong [2] has attempted 
to characterize the accuracy limit by analyzing the perfor- 
mance of a collection of “cheating” prediction algorithms 
that employ a partial knowledge of future results. They 
conclude that further large improvements in prediction ac- 
curacy are unlikely. 


Estimating a tight bound to prediction accuracy is challeng- 
ing, because one needs to utilize some information about 
future correctness without merely regurgitating the stream 
of actual outcomes as one’s predictions, which would yield 
the tautological bound of 100% accuracy. Beck and Xiong 
navigate this conundrum by allowing their cheating model 
to correctly predict the transitions from giving an incorrect 
response to giving a correct response (e.g., learning), but 
not those from giving an correct response to giving a in- 
correct response (in their words, “forgetting”). We found 
this approach to be unsatisfying in two respects. First, the 
time period in which the data is collected is too short for 
true forgetting to take place, it is rather more likely to be 
slipping, so we feel that the model is a mismatch for the phe- 
nomena at hand. Second, we feel that perfectly predicting 
incorrect-to-correct transitions but not correct-to-incorrect 
transitions seems arbitrary. 


Instead, we posit that the limits of accuracy for next-item- 
correct prediction derive from the fact that learning is not 
a binary transition from a state of not knowing to a state 
of knowing, but rather that there is a continuum of knowl- 
edge levels that a student could be at. For example, there 
is a point on this continuum where a student will get 50% 
of the problems attempted correct and the other 50% in- 
correct. The challenge for next-item-correct prediction for 
such a student is precisely determining whether the next at- 
tempt will be correct or incorrect, much like the hopeless 
task of trying to consistently predict the outcome of flipping 
a fair coin. More precisely, it is the student responses as 
they transition from not knowing to knowing that are hard 
to predict, as the behavior of perfectly knowledgeable and 
perfectly unknowledgeable students is trivial to predict. 


Thus, the limit for prediction should primarily derive from 
the fraction of a data stream during which students are in 
this transitional phase where they are intermingling correct 
and incorrect responses. This can be viewed as the amount 
of entropy in the data, and this entropy can and does vary 
from dataset to dataset. As such, we believe that a method 
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Figure 1: [llustrative example of the input to the 
next-item-correct prediction problem. For this ex- 
ample, n = 10 and a1,...,%, = 0,0,1,0,0,1,1,0,1,1. 


that can estimate the limits of predictability as a function 
of this entropy can serve as a less arbitrary estimate of the 
accuracy limit for next-item-correct prediction and serve as 
a useful means for characterizing and comparing datasets. 


This paper is organized as follows. We first formalize the 
next-item-correct prediction problem in Section 2. We then 
describe our model-free bounding method in Section 3. We 
show experimental results of our method in Section 4. Fi- 
nally, we discuss the limitations of our method in Section 5 
and future directions in Section 6. 


2. NEXT-ITEM-CORRECT PREDICTION 


We formalize the next-item-correct prediction problem as 
follows. We are given a length-n sequence 21,...,2%n, where 
az; = 1 if the student answered the ith attempted item cor- 
rectly and x; = 0 otherwise, as shown in Figure 1. Given this 
information, we want to produce n reals pi,...,Pn where 
pi is the probability of the student answering the ith at- 
tempted item correctly. Typically models are required to 
produce pi,...,Pn in order and they are only allowed to 
look at x1,...,%+-1 when producing p:, as future observa- 
tions should not be available during prediction. Some of the 
notable models for this task are Bayesian Knowledge Trac- 
ing (BKT) [3], Performance Factor Analysis (PFA) [8], and 
Deep Knowledge Tracing (DKT) [10]. 


In efforts to improve their performance, many models use 
the knowledge components required by each item, denoted 
as 51,...,5n. Each 5; is a d dimensional vector where d 
is the number of knowledge components in the correspond- 
ing dataset. Each entry of 5; is typically boolean, indicat- 
ing whether the item requires the corresponding knowledge 
component. The entries of 5; can be real valued as well, in- 
dicating the degree of mastery required on each component 
in order to answer the item correctly. 


With the ground truth z1,...,2, and predictions of a model 
P1,---;Pn, a performance metric L is typically used to mea- 
sure how good the predictions are. The most widely used 
metrics for this task are root mean squared error (RMSE) 
and area under the curve (AUC) [9]. Log likelihood (LL) has 
also been proposed [9] though it has not been widely used on 
this task. This paper will use average LL instead of LL since 
the former does not depend on the size of the data. Mod- 
els with better £(pi,...,Pn;@1,---,2n) are to be preferred. 
The meaning of “better” depends on the metric; larger val- 
ues are better for average LL and AUC while smaller values 


Figure 2: 
method when all items require the same knowledge 
component. 


Results of the model-free bounding 


are better for RMSE. 


3. MODEL-FREE ACCURACY BOUNDS 


The core idea of our method is that the probability of a stu- 
dent correctly answering items that require the same knowl- 
edge components should be non-decreasing over the short 
term. More precisely, if the current item is no more difficult 
than a previous item that requires the same knowledge and 
there hasn’t been sufficient time or interference for forget- 
ting to occur, the student’s probability of getting the current 
item correct should be at least as high as the previous item. 


This idea is illustrated in Figure 2, where the dashed line seg- 
ments correspond to the probability of the student correctly 
answering each item. One could interpret this sequence as 
having three phases: (1) items 1 and 2 as a region of unknow- 
ing where the student gets every item incorrect, (2) items 3 
through 8 as a region of learning where correct and incorrect 
responses are interleaved, and (3) items 9 and 10 as a region 
of mastery where the student gets every item correct. Even 
though the second region includes both correct and incor- 
rect responses, we are interpreting those merely as events 
from an underlying probability distribution and that proba- 
bility of correct responses is non-decreasing throughout the 
sequence. 


Based on this idea, our proposed bounding method finds 
correctness probabilities for each item pj,...,p;, that opti- 
mize L(pj,...,Pnj321,---,2%n) subject to the constraint that 
the p; sequence is non-decreasing on appropriate item se- 
quences. These p; provide the best local estimate of the 
likelihood that a student will get an item correct given an 
assumption that only learning is occuring. To do better, 
one would have to predict the precise sequence of correct 
and incorrect responses and we believe that this problem is 
akin to predicting the precise sequence of heads and tails 
from repeated flips of a coin. As such, we expect this to be 
a practical bound to next-item-correct prediction. 


We refer to this method as being “model free”, because it 
does not rely on any parameterized model of student behavy- 
iors and does not require training. Instead, the p; values are 
derived directly from the sequence x1,..., 2, and, therefore, 
can be potentially applied on any dataset. 


3.1 Single knowledge component case 


Before diving into the case where multiple knowledge com- 
ponents are involved, we first explain our method in the 
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simplest case where the sequence of items require the same 
knowledge component. In this case, since all of the items 
are equivalent in terms of knowledge components, the afore- 
mentioned constraint is equivalent to constraining pi,...,Dn 
to be non-decreasing. Thus our method reduces to solv- 
ing the following numerical optimization problem to obtain 
Pipe ++) Pri 


optimize: L(pi,...,PnjT1,.--,;2n) 


subject to: 0 < pi < 1 for all (1) 
pi <p; for alli < j. 


This numerical optimization problem can be solved efficiently 
by an interior point method if (1) £ is convex and smaller 
L is better, or (2) £ is concave and larger L is better. Out 
of the three metrics mentioned previously, average LL and 
RMSE satisfy this criterion while AUC is not even continu- 
ous (and hence not convex or concave). Thus this formula- 
tion as a numerical optimization problem is only applicable 
when £ is average LL or RMSE. There are various tools 
that can solve this sort of numerical optimization problem. 
In our implementation we used used Matlab’s fmincon with 
L-BFGS as the Hessian method. 


To give a sense of what this method produces, Figure 2 shows 
as the dashed line the values pj,..., p7, that minimize RMSE 
for the given observed item responses 11,...,2n (solid black 
dots). 


3.2. Partial order of items 


In order to handle sequences of items with different combi- 
nations of multiple knowledge components, we need to be 
able to compare the items and decide which previously at- 
tempted items provide information useful for predicting the 
outcome of the current item. The intuition is that if item a 
is the same difficulty or easier with respect to the required 
knowledge components than item 6, then a student should 
do item a at least as well as item b. We compare items by 
defining a partial order < over the knowledge component 
vectors as follows: 


Sa X So <=> Sa,r < 53b,x for all k, (2) 


where §a,x is the kth coordinate of 8. This partial order 
essentially states that item a should be considered easier 
than or equal to item 6 if the required mastery level of each 
knowledge component of item a is less than or equal to that 
of item b. Intuitively, given 54 < 5%, then a student should 
be able to answer item a correctly if the student can answer 
item b correctly. 


Given this definition of partial order, we can induce a di- 
rected acyclic graph (DAG) on the set of items, where there 
is an edge from the jth item to the ith if and only if i < 7 
and §; x §;. The intuition of the requirement 7 < 7 is that 
being able to solve a “harder” item in the past implies being 
able to solve an “easier” item in the future. However, being 
able to solve a “harder” item in the future does not imply 
being able to solve an “easier” item in the past since the stu- 
dent might have learned a lot in between. To illustrate this, 
we show the DAG induced by a sequence of 6 items with 3 
knowledge components in Figure 3. In such a DAG, an edge 
from the jth item to the ith means that the student should 
be able to do the jth item at least as well as the ith item. 


KO>=0 O-O 


S4 So S3 S4 S5 Sg 


EEE eae Ea 


Figure 3: A directed acyclic graph induced by the 
partial order. An arrow from the jth item to the 
ith item means that the student should do the jth 
item at least as well as the ith item. There are two 
connected components in this induced graph, which 
are {x1,%2,x3,xv6} and {24,25}. 


3.3. Multiple knowledge components case 


Given the partial order on items as described above, we can 
generalize the non-decreasing constraints for a single knowl- 
edge component to handle any combination of knowledge 
components. Specifically, given 1 < 7 and 3} = 5;, the prob- 
ability p; of the student answering the jth item correctly 
should not be lower than the probability p; of the ith item 
since the jth item is no harder than the ith item. That 
is, pis <p; when there is an edge from the jth item to the 
ith item in the induced DAG on the sequence. Thus the 
optimization problem can be reformulated as 


optimize: L(p1,...,PnjT1,---+; Ln) 


subject to: 0 < p; <1 for alli (3) 
pi <p; for alli <j that satisfy 5) < %. 


This complicated optimization problem can usually be bro- 
ken down into smaller ones by dividing the sequence 71,..., %n 
into shorter subsequences based on the connected compo- 
nents they belong to in the induced DAG. In the example 
depicted by Figure 3, there are two connected components 
which correspond to {11,%2,%3,x76} and {24,25}. We can 
then optimize on each subsequence separately. 


Another trick to accelerate the optimization is removing re- 
dundant constraints since the partial order is transitive. For 
example, the constraint corresponding to the edge from §3 
to §1 in Figure 3 can be safely removed since it is implied 
by constraints corresponding to 53 X 52 and 32 X 3}. 


3.4 Metrics that cannot be directly optimized 


As mentioned before, our method is not applicable to AUC 
since it is not continuous. To compute a bound for AUC, we 
first solve the optimization problem by either maximizing 
average LL or minimizing RMSE. Once we obtained p; for 
the entire dataset, we can calculate AUC using these p;. 


In general, we can always optimize on one metric £ for p; 
and evaluate the p; with any metric £’ even though the 
optimization is done with respect to £. We refer to this as 
the bound obtained by optimizing CL. 


4. EXPERIMENTAL RESULTS 


We applied BKT, DKT, and our method to the Assistments 
2009-2010 dataset. We chose this dataset because it has 
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Figure 4: Results of applying BKT, DKT, and our method to Assistments 2009-2010 dataset. 


relatively long sequences of attempts. We used the same 
train/test split for this dataset as in Khajah et al. [5]. We 
used the BKT implementation by Yudelson' [11] with the 
default parameters and Baum-Welch as the training method. 
We used Khajah et al.’s [5] implementation of DKT? with 
default parameters. We only applied our method to the test 
set for meaningful comparisons. 


For the rest of this paper, we only report bounds obtained 
by maximizing average LL. Throughout our experiments, 
we found that the bounds for all of average LL, RMSE, and 
AUC obtained by minimizing RMSE differed by less than 
0.5% from those obtained by maximizing average LL. In fact, 
it can be proved that minimizing RMSE and maximizing 
average LL will yield the same p; in the single knowledge 
component case (Equation 1). See the Appendix for the 
proof. 


We show our results on Assistments 2009-2010 for average 
LL, RMSE, and AUC in Figure 4. The performance of DKT 
is roughly half way between BKT and the bound produced 
by our method for all of the metrics. This suggests that the 
room for further improvements on Assistments 2009-2010 is 
limited. 


5. LIMITATIONS 


The major limitation of our method is its optimistic nature, 
meaning that it can produce a bound that is too loose. This 
optimism manifests in two ways: first, our method can pre- 
dict the precise location of learning transitions, which will 
be difficult for any realistic model, and, second, more gen- 
erally when the sequence of predictions to be made is short 
the model isn’t significantly constrained. 


5.1 Predicting Particular Events 


The proposed technique appears to provide a reasonable 
bound of prediction performance when student behavior fol- 
lows a non-instantaneous learning of a topic involving an 
interleaving of correct and incorrect responses as shown in 
Figure 1. However, when students transition instantly from 
consistently answering incorrectly to consistently answering 
correctly, the model will likely produce a bound that is too 
loose. Consider the item response sequences of two students 
shown in Figure 5. Both of these students only transition 
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Figure 5: Two sequences that our method predicts 
perfectly. A real predictor, however, might have 


trouble predicting the precise location of the upward 
transition. 


Figure 6: Our method can predict initial behav- 
ior perfectly in some circumstances. A real pre- 
dictor, however, might have trouble predicting pre- 
cisely which students would get a problem correct 
on their first attempt. 


from incorrect responses to correct responses, meaning that 
the optimization is free to generate predictions that precisely 
match the data, resulting in 100% accuracy. A real model, 
however, must predict the point of the transition, know- 
ing that after observing the first three incorrect responses it 
should predict correct for the first student’s fourth attempt 
and incorrect for the second student’s fourth attempt. While 
it isn’t impossible to imagine that there are features to guide 
such a prediction, it is difficult to believe that it could be 
done consistently with 100% accuracy. 


A special case of predicting such a transition is predicting 
whether or not the very first attempt is going to be cor- 
rect. As shown in Figure 6, our method can perfectly pre- 
dict whether or not a student gets their first attempt correct, 
provided the student gets all other attempts correct. A real 
system might be challenged to predict precisely which stu- 
dents would perform in this manner, although some knowl- 
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Figure 7: Upper bounds produced by our method versus theoretical bounds for attempt results that are i.i.d. 
with fixed q for various sequence lengths. The solid curves correspond to the results of our method and the 


dashed lines correspond to the theoretical bounds. 


edge about the students will certainly enable such predic- 
tions to be performed at a rate better than just the average 
frequency that students get a given question correct on their 
first attempt. Nevertheless, these features of the data lead 
our system to be optimistic, and these features occur more 
frequently and have larger impact on short sequences. 


5.2 Short Sequences 


In general, our method struggles with short sequences, be- 
cause the optimization is largely unconstrained. For exam- 
ple, consider the case where every student has made exactly 
one attempt. In such a case our method will always produce 
pi that is exactly the same as 271, which results in a trivial 
bound of 100% accuracy. However, as the sequence length 
increases, the constraints will generally prevent our method 
from being perfectly accurate, and thus it will provide a 
more useful bound. 


To understand how the amount of optimism in our method 
depends on the sequence length, we used independent and 
identically distributed (i.i.d.) coin tosses to study this. Such 
sequences allow us to compute a theoretical bound that we 
can compare to the one produced by our method. When 
attempt results 71,...,% are i.i.d. with probability q of 
being correct, the theoretical bound is g log g+(1—q) log(1— 
q) for average LL, \/q(1 — q)? + (1 — q)q? for RMSE, and 0.5 
for AUC. 


Specifically, we generated i.i.d. results with sequence lengths 
ranging from 1 to 100 and with q ranging from 0.1 to 0.9 
and same § for every attempt. For each length, we generated 
10,000 sequences and computed the bound for average LL, 
RMSE, and AUC using our method. 


We plotted the bounds computed by our method and the 
theoretical bound in Figure 7. We chose to not plot the 
results for gq from 0.1 to 0.4 in the figure since we found 
that q and 1 — q yield the same results. The solid curves 
in the figure correspond to the results of our method for 
each q while the dashed lines correspond to the theoretical 
bound for each qg. As the figure shows, our method starts off 
wildly optimistic when the sequence length is 1 and grad- 
ually converges to the theoretical bounds as the sequence 
length increases. At a sequence length of 100, the bounds 
by our method are close to the theoretical bound for average 


LL and RMSE but not AUC. These trends suggest that our 
method works reasonably well for average LL and RMSE 
when the sequence length is large enough, however it is too 
optimistic on AUC even with long sequences. 


6. DISCUSSION AND CONCLUSION 


In this paper, we presented a model-free bounding method 
to find the limit of the next-item-correct prediction task. 
The method assumes that forgetting is absent and uses the 
constraint that the probability of students correctly answer- 
ing a set of similar items should not decrease as they practice 
more. We applied our method to the Assistments 2009-2010 
dataset and found that DKT’s performance on this dataset 
is fairly close to the bound produced by our method. This 
suggests that the room for improvement on this dataset is 
small. 


The main shortcoming of our method is its optimistic na- 
ture. In other words, our method will produce a bound that 
is too loose, especially for short sequences. While we can 
conceive of many ways to potentially compensate for this op- 
timism (motivated by the scenarios discussed in Section 5), 
we fear that any attempts we make to estimate compensa- 
tion factors has the potential to yield a result that no longer 
serves as a bound (i.e., that a real implementation could 
potentially achieve a performance exceeding our “bound”). 
Furthermore, we view the parameter-free simplicity of our 
method to be one of its virtues, and it is not clear how 
to preserve that while introducing such compensation. The 
other shortcoming is that our method does not incorporate 
forgetting by default. However, this could potentially be 
incorporated by relaxing constraints when forgetting is sus- 
pected to have occurred. 


The intuition behind our method is based on the reason why 
next-item-correct prediction is feasible. Since independent 
identically distributed (i.i.d.) coin tosses are inherently un- 
predictable, next-item-correct prediction is feasible only if 
there are regularities in the data. Learning is undoubtedly 
the most important regularity that we would like to observe 
in any educational system. Thus the difficulty of the next- 
item-correct prediction task depends on how much students’ 
performance deviates from i.i.d. and shows non-decreasing 
behavior. Our method tries to capture such regularities due 
to learning. 
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APPENDIX 


To prove that minimizing RMSE is equivalent to maximizing 
average LL in the case of Equation 1, we first recall the 
concept of a scoring rule [4], which is a function that scores a 
predictive probability distribution P against an observation 
x; drawn from a target probability distribution @ that we 
are trying to recover. In this context a larger score indicates 
a better P. In the case of binary variables with range {0, 1}, 
both P and Q are Bernoulli distributions and a scoring rule 
can be simply denoted as S(p,x), where p is the probability 
of observing 1 in P and z is an observation drawn from Q. 


A strictly proper scoring rule is a scoring rule such that the 
expected score over a set of observations drawn from @ is 
uniquely maximized when P = Q [4]. The quadratic score 
and the logarithmic score are two commonly used strictly 
proper scoring rules. In the case of Equation 1, maximizing 
the quadratic score is equivalent to minimizing RMSE and 
maximizing the logarithmic score is equivalent to maximiz- 
ing average LL. 


In the binary case, a strictly proper scoring rule S(p,x) has 
the Savage representation S(p,x) = G(p) + G*(p)(a — p) 
where G is strictly convex and G* is a subdifferential of 
G [4]. Define the cost function F(p;21,...,¢n) by F(p) = 
+ ier S(p, @i) = G(p) +G" (p)(@—p) where & = 2 77, wi. 


LEMMA 1. F(p) has a unique mazimum at p = & and is 
strictly quasiconcave, thus unimodal. 


PRooF. First observe that F(p) = G(p)+G*(p)(@—p) < 
G(£) = F(Z) by the definition of the subdifferential, with 
equality if and only if p = z. Thus p = Z is the unique 
maximum. 


To establish quasiconcavity, we will show that for any a € 
(0,1), F(ap + (1 — a)q) > min{F(p), F(q)}. Let r = ap + 
(1 — a)q and, without loss of generality, assume p < q, so 
either p< r<Zor%<r<4q. In the first case: 


F(r) — F(p) = G(r) — G(p) + G*(r)(@ — r) — G"(p)(& — p) 
> G"(p)(r — p) + G*(r)(Z — r) — G*(p)(& — p) 
= (G"(r) — G"(p))(@ — 7) 
> 0. 
The last step is due to monotonicity of G*, which states that 
(G* (r)—G*(p))(r—p) = 0, and because (Z—r) has the same 
sign as (r—p) we have (G*(r)—G* (p))(—-r) > 0. This estab- 
lishes that F'(r) > F(p) in the first case. Similarly, F'\(r) > 
F‘(q) in the second case, thus F'(r) > min{F(p), F(q)}. 


For any solution to Equation 1, we can partition pi,...,pn 
into blocks (subsets) where each member of a block has equal 


value and no two blocks share a value. Because Equation 1 
requires monotonicity, each block must have consecutive in- 
dices. 


LEMMA 2. If £ is a strictly proper scoring rule, then ev- 
ery solution to Equation 1 consists of blocks of the form 


pix... = pj = {ei,-.., 07} = Dp ee/(G-i +1). 


PROOF. Consider any block p = pj = ... = p; in a solu- 
tion to the optimization problem described by Equation 1 
when CL is a strictly proper scoring rule. Because blocks have 
distinct values, p is locally unconstrained and so Lemma 1 
implies p = {ai,..., aj}. 


Algorithm 1 
liecl 
2: while i <n do 
3: find the largest 7 with 7 < 7 <n that minimizes 
{xi, dee yf 


4: Dis. ++, Pj  {Ui,..., 25} 
5: ie jgtl 
6: end while 


THEOREM 1. If £ is a strictly proper scoring rule, then 
Algorithm 1 gives the unique solution to Equation 1. 


ProoF. Let pj,...,p7, be the output of Algorithm 1. As- 
sume that p1,..., Pn is a distinct solution to Equation 1. Let 
k be the first index for which py ¢ px and let pj,...,p} be 
the block withi<k <j. 


If pe < pz, then monotonicity implies k = 7. Let {px,..., pe} 
be the following block, so py > pr = {xk,..., ve}, which 
contradicts Line 3 in Algorithm 1. 


If pr > pz, then py < pr < {xe,...,x2;} because the opti- 
mization subproblems for blocks in {pz,...,p;} are locally 
unconstrained below. But by Lemma 2 we have: 


De ={@i,...,25} 


phat j-k+1 
= aga ge 
k-i ,, j-k+1, 
| 
OG gat Gopenet 
= Pk, 


which is again a contradiction. 


Note that Algorithm 1 does not depend on Z, so all strictly 


proper scoring rules give the same solution to Equation 1. 
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